Factors

Defining Categorical Types

Rodney Dyer, PhD

Topics Covered

In this brief presentation, we’ll be introducing the following items:

  • Factor Data Types
  • The library
  • Workflows & Pipes
  • Tables

Categorical Data

Categorical Data Types

 

Unique and individual grouping that can be applied to a study design.

  • Case sensitive
  • Can be ordinal
  • Typically defined as character type
weekdays <- c("Monday","Tuesday","Wednesday",
              "Thursday","Friday","Saturday", 
              "Sunday")
class( weekdays )
[1] "character"
weekdays
[1] "Monday"    "Tuesday"   "Wednesday" "Thursday"  "Friday"    "Saturday" 
[7] "Sunday"   

Making Up Data 🤷🏻‍

The function sample() allows us to take a random sample of elements from a vector of potential values.

chooseOne <- sample( c("Heads","Tails"), size=1 )
chooseOne
[1] "Tails"

Making Up More Data 🤷🏻‍

However, if we want a large number items, we can have them with or without replacement.

sample( c("Heads","Tails"), size=10, replace=TRUE )
 [1] "Tails" "Heads" "Tails" "Tails" "Tails" "Tails" "Tails" "Heads" "Tails"
[10] "Tails"

Weekdays as example

We’ll pretend we have a bunch of data related to the day of the week.

days <- sample( weekdays, size=40, replace=TRUE)
summary( days )
   Length     Class      Mode 
       40 character character 
days
 [1] "Sunday"    "Wednesday" "Friday"    "Wednesday" "Friday"    "Monday"   
 [7] "Friday"    "Friday"    "Saturday"  "Tuesday"   "Sunday"    "Wednesday"
[13] "Thursday"  "Thursday"  "Tuesday"   "Monday"    "Monday"    "Thursday" 
[19] "Saturday"  "Sunday"    "Wednesday" "Saturday"  "Wednesday" "Friday"   
[25] "Saturday"  "Monday"    "Friday"    "Sunday"    "Monday"    "Sunday"   
[31] "Sunday"    "Monday"    "Monday"    "Sunday"    "Sunday"    "Monday"   
[37] "Sunday"    "Friday"    "Tuesday"   "Thursday" 

Turn it into a factor

data <- factor( days )
is.factor( data )
[1] TRUE
class( data )
[1] "factor"

Data Type Specific Printing & Summaries

data
 [1] Sunday    Wednesday Friday    Wednesday Friday    Monday    Friday   
 [8] Friday    Saturday  Tuesday   Sunday    Wednesday Thursday  Thursday 
[15] Tuesday   Monday    Monday    Thursday  Saturday  Sunday    Wednesday
[22] Saturday  Wednesday Friday    Saturday  Monday    Friday    Sunday   
[29] Monday    Sunday    Sunday    Monday    Monday    Sunday    Sunday   
[36] Monday    Sunday    Friday    Tuesday   Thursday 
Levels: Friday Monday Saturday Sunday Thursday Tuesday Wednesday

Factor Levels

Each factor variable is defined by the levels that constitute the data. This is a .red[finite] set of unique values

levels( data)
[1] "Friday"    "Monday"    "Saturday"  "Sunday"    "Thursday"  "Tuesday"  
[7] "Wednesday"

Ordinal Categorical Data

Factor Ordination

If a factor is not ordinal, it does nota allow the use relational comparison operators.

data[1] < data[2]
Warning in Ops.factor(data[1], data[2]): '<' not meaningful for factors
[1] NA

Ordination = Ordered

is.ordered( data )
[1] FALSE

Ordination of Factors

Where ordination matters:

  • Fertilizer Treatments in KG of N2 per hectare: 10 kg N2, 20 N2, 30 N2,

  • Days of the Week: Friday is not followed by Monday,

  • Life History Stage: seed, seedling, juvenile, adult, etc.

Where ordination is irrelevant:

  • River

  • State or Region

  • Sample Location

Making Ordered Factors

data <- factor( days, ordered = TRUE)
is.ordered( data )
[1] TRUE
data
 [1] Sunday    Wednesday Friday    Wednesday Friday    Monday    Friday   
 [8] Friday    Saturday  Tuesday   Sunday    Wednesday Thursday  Thursday 
[15] Tuesday   Monday    Monday    Thursday  Saturday  Sunday    Wednesday
[22] Saturday  Wednesday Friday    Saturday  Monday    Friday    Sunday   
[29] Monday    Sunday    Sunday    Monday    Monday    Sunday    Sunday   
[36] Monday    Sunday    Friday    Tuesday   Thursday 
7 Levels: Friday < Monday < Saturday < Sunday < Thursday < ... < Wednesday

The problem is that the default ordering is actually alphabetical!

Specifying the Order

Specifying the Order of Ordinal Factors

data <- factor( days, ordered = TRUE, levels = weekdays)
data
 [1] Sunday    Wednesday Friday    Wednesday Friday    Monday    Friday   
 [8] Friday    Saturday  Tuesday   Sunday    Wednesday Thursday  Thursday 
[15] Tuesday   Monday    Monday    Thursday  Saturday  Sunday    Wednesday
[22] Saturday  Wednesday Friday    Saturday  Monday    Friday    Sunday   
[29] Monday    Sunday    Sunday    Monday    Monday    Sunday    Sunday   
[36] Monday    Sunday    Friday    Tuesday   Thursday 
7 Levels: Monday < Tuesday < Wednesday < Thursday < Friday < ... < Sunday

Sorting Is Now Relevant

sort( data )
 [1] Monday    Monday    Monday    Monday    Monday    Monday    Monday   
 [8] Monday    Tuesday   Tuesday   Tuesday   Wednesday Wednesday Wednesday
[15] Wednesday Wednesday Thursday  Thursday  Thursday  Thursday  Friday   
[22] Friday    Friday    Friday    Friday    Friday    Friday    Saturday 
[29] Saturday  Saturday  Saturday  Sunday    Sunday    Sunday    Sunday   
[36] Sunday    Sunday    Sunday    Sunday    Sunday   
7 Levels: Monday < Tuesday < Wednesday < Thursday < Friday < ... < Sunday

Fixed Set of Levels

You cannot assign a value to a factor that is not one of the pre-defined levels.

data[3] <- "Bob"
Warning in `[<-.factor`(`*tmp*`, 3, value = "Bob"): invalid factor level, NA
generated

forcats

The forcats library

Part of the tidyverse group of packages.

library(forcats)

This library has a lot of helper functions that make working with factors a bit easier. I’m going to give you a few examples here but strongly encourage you to look a the cheat sheet for all the other options.

Counting Factors

fct_count( data )
# A tibble: 8 × 2
  f             n
  <fct>     <int>
1 Monday        8
2 Tuesday       3
3 Wednesday     5
4 Thursday      4
5 Friday        6
6 Saturday      4
7 Sunday        9
8 <NA>          1

Lumping Factors

lumped <- fct_lump_min( data, min = 5)
fct_count(  lumped )
# A tibble: 6 × 2
  f             n
  <fct>     <int>
1 Monday        8
2 Wednesday     5
3 Friday        6
4 Sunday        9
5 Other        11
6 <NA>          1

Reordering Factors

We can reorder by appearance order, observations, or numeric

By Frequency

freq <- fct_infreq( data )
levels( freq )
[1] "Sunday"    "Monday"    "Friday"    "Wednesday" "Thursday"  "Saturday" 
[7] "Tuesday"  

By Order of Appearance

ordered <- fct_inorder( data )
levels( ordered )
[1] "Sunday"    "Wednesday" "Friday"    "Monday"    "Saturday"  "Tuesday"  
[7] "Thursday" 

Reorder Specific Levels

newWeek <- fct_relevel( data, "Saturday", "Sunday")
levels( newWeek )
[1] "Saturday"  "Sunday"    "Monday"    "Tuesday"   "Wednesday" "Thursday" 
[7] "Friday"   

Dropping Missing Levels

data <- sample( weekdays[1:5], size=40, replace=TRUE )
data <- factor( data, ordered=TRUE, levels = weekdays )
summary( data )
   Monday   Tuesday Wednesday  Thursday    Friday  Saturday    Sunday 
        4         7        10         7        12         0         0 

Dropping Missing Levels

fct_drop( data ) -> dropped
summary( dropped )
   Monday   Tuesday Wednesday  Thursday    Friday 
        4         7        10         7        12 

Example iris Data

Ronald Aylmer Fisher
1890 - 1962

British polymath, mathematician, statistican, geneticist, and academic. Founded things such as:

  • Fundamental Theorem of Natural Selection,
  • The F test,
  • The Exact test,
  • Linear Discriminant Analysis,
  • Inverse probability
  • Intra-class correlations
  • Sexy son hypothesis…. 🥰

head( iris, )
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
summary( iris )
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                

Operating on Factor Levels

Question: What is the mean and variance in sepal length for each of the Iris species?

The by() function allows us to perform some function on data based upon a grouping index.

by( data, index, function )

by()

Here we can apply the function mean() to the data on sepal length using the species factor as a category.

by( iris$Sepal.Length, iris$Species, mean)
iris$Species: setosa
[1] 5.006
------------------------------------------------------------ 
iris$Species: versicolor
[1] 5.936
------------------------------------------------------------ 
iris$Species: virginica
[1] 6.588

by()

The same for estimating variance

by( iris$Sepal.Length, iris$Species, var)
iris$Species: setosa
[1] 0.124249
------------------------------------------------------------ 
iris$Species: versicolor
[1] 0.2664327
------------------------------------------------------------ 
iris$Species: virginica
[1] 0.4043429

Missing Data

Missing data is a fact of life and R is very opinionated about how it handles missing values. Where this becomes tricky is when we are doing operations on data that has missing values. R could take two routes:

  1. It could ignore the data and give you the answer directly as if the data were not missing, or
  2. It could let you know that there is missing data and make you do something about it.

Fortunately, R took the second route.

Example

df <- iris[, 4:5]
df$Petal.Width[ c(2,6,12) ] <- NA
summary( df )
  Petal.Width          Species  
 Min.   :0.100   setosa    :50  
 1st Qu.:0.300   versicolor:50  
 Median :1.300   virginica :50  
 Mean   :1.218                  
 3rd Qu.:1.800                  
 Max.   :2.500                  
 NA's   :3                      

Screaming About NA

If there is ONE NA, then most mathematical operaitons will not give you an answer.

mean( df$Petal.Width )
[1] NA

To tell R that you are both aware and ok with the estimation of a mean value in the presence of missing data, you will have to change an optional argument passed to the function.

mean( df$Petal.Width, na.rm = TRUE)
[1] 1.218367

Adding Optionals to by()

You’ll have to do the same thing when using by()

by( df$Petal.Width, df$Species, mean, na.rm=TRUE )
df$Species: setosa
[1] 0.2446809
------------------------------------------------------------ 
df$Species: versicolor
[1] 1.326
------------------------------------------------------------ 
df$Species: virginica
[1] 2.026

Workflows & Pipes

Hypothetical Workflow

A common workflow consists of taking some data and performing several operations on it before we do some kind of analysis, summary, plot, or table. It can be

data <- read_csv( file )
function( data ) -> data1
data2 <- func2( data1 )
data3 <- func3( data2 )
data4 <- func4( data3 )

This causes a lot of data duplication of the intermediate steps, extra typing, etc. Remember we strive for minimal effort!

 

The Treachery of Images

The Pipe Operator

In R we use this grammar.

data %>% Y()

To take the values in data and pass them as if you entered the data as the first argument to the function Y().

These pipes can be chained together into a single operation.

data %>%
  func1() %>%
  func2() %>%
  func3() -> newData

Tidyverse

The maggitr library is part of the tidyverse group of packages, so it is always easier to just load in tidy

library( tidyverse )

Example

Here is an operation that we’ve used as summary( iris ) in the past, but it can be used in a pipe like this.

iris %>% summary() 
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                

Tables
knitr + table -> kable

Let’s Make a Table!

The knitr library has some nice basic functionality to make tables.

library( knitr )

Make Summary Data Frame

The table should have the species names and the averages length and width of sepals.

Make a new data frame and set the First Column as species.

df <- data.frame( Species = levels( iris$Species ) )
df
     Species
1     setosa
2 versicolor
3  virginica

Make Summary Data Frame

Use the by() function to estimate mean length and width

df$Length <- by( iris$Sepal.Length, iris$Species, mean )
df$Width <- by( iris$Sepal.Width, iris$Species, mean )
df
     Species Length Width
1     setosa  5.006 3.428
2 versicolor  5.936 2.770
3  virginica  6.588 2.974

Making A Table

So, now we’ll use our new pipe operator to pass the data into the kable() fuction (n.b., look at ?kable and see that the first argument is the data, which is being substituted by the pipe).

df %>%
  kable() 
Species Length Width
setosa 5.006 3.428
versicolor 5.936 2.770
virginica 6.588 2.974

Table captions

df %>%
  kable( caption = "Sepal size for three species of Fisher's iris data.")
Sepal size for three species of Fisher’s iris data.
Species Length Width
setosa 5.006 3.428
versicolor 5.936 2.770
virginica 6.588 2.974

Making More Fancy Tables

The library kableExtras has a lot more functionality that can be added to the table.

library( kableExtra )

Table Themes

df %>%
  kable() %>%
  kable_paper()
Species Length Width
setosa 5.006 3.428
versicolor 5.936 2.770
virginica 6.588 2.974

Table Themes

df %>%
  kable() %>%
  kable_classic()
Species Length Width
setosa 5.006 3.428
versicolor 5.936 2.770
virginica 6.588 2.974

Table Themes

df %>%
  kable() %>%
  kable_classic_2()
Species Length Width
setosa 5.006 3.428
versicolor 5.936 2.770
virginica 6.588 2.974

Table Themes

df %>%
  kable() %>%
  kable_minimal()
Species Length Width
setosa 5.006 3.428
versicolor 5.936 2.770
virginica 6.588 2.974

Table Themes

df %>%
  kable() %>%
  kable_material()
Species Length Width
setosa 5.006 3.428
versicolor 5.936 2.770
virginica 6.588 2.974

Table Themes

df %>%
  kable() %>%
  kable_material_dark()
Species Length Width
setosa 5.006 3.428
versicolor 5.936 2.770
virginica 6.588 2.974

Table Sizes

df %>%
  kable() %>%
  kable_paper( full_width=FALSE)
Species Length Width
setosa 5.006 3.428
versicolor 5.936 2.770
virginica 6.588 2.974

Table Positions

df %>%
  kable() %>%
  kable_paper( full_width=FALSE, position="right")
Species Length Width
setosa 5.006 3.428
versicolor 5.936 2.770
virginica 6.588 2.974

position = "float_right"

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut blandit libero sit amet porta elementum. In imperdiet tellus non odio porttitor auctor ac sit amet diam. Suspendisse eleifend vel nisi nec efficitur. Ut varius urna lectus, ac iaculis velit bibendum eget. Curabitur dignissim magna eu odio sagittis blandit.

Species Length Width
setosa 5.006 3.428
versicolor 5.936 2.770
virginica 6.588 2.974

Vivamus sed ipsum mi. Etiam est leo, mollis ultrices dolor eget, consectetur euismod augue. In hac habitasse platea dictumst. Integer blandit ante magna, quis volutpat velit varius hendrerit. Vestibulum sit amet lacinia magna. Sed at varius nisl. Donec eu porta tellus, vitae rhoncus velit.

Fancier Grouped Headers

df %>% 
  kable() %>% 
  kable_paper( full_width=FALSE) %>%
  add_header_above( c(" "=1, "Size (cm)" = 2))
Size (cm)
Species Length Width
setosa 5.006 3.428
versicolor 5.936 2.770
virginica 6.588 2.974

Questions

If you have any questions, please feel free to either post them as an “Issue” on your copy of this GitHub Repository, post to the Canvas discussion board for the class, or drop me an email.

Peter Sellers looking bored